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ABSTRACT 



Classification of antibody complementarity-determining region (CDR) conforma- 
tions is an important step that drives antibody modelling and engineering, prediction 
from sequence, directed mutagenesis and induced-fit studies, and allows inferences 
on sequence-to-structure relations. Most of the previous work performed confor- 
mational clustering on a reduced set of structures or after application of various 
structure pre-filtering criteria. In this study, it was judged that a clustering of every 
available CDR conformation would produce a complete and redundant repertoire, 
increase the number of sequence examples and allow better decisions on structure 
validity in the future. In order to cope with the potential increase in data noise, a 
first-level statistical clustering was performed using structure superposition Root- 
Mean-Square Deviation (RMSD) as a distance-criterion, coupled with second- and 
third-level clustering that employed Ramachandran regions for a deeper qualitative 
classification. The classification of a total of 12,712 CDR conformations is thus 
presented, along with rich annotation and cluster descriptions, and the results 
are compared to previous major studies. The present repertoire has procured an 
improved image of our current CDR Knowledge-Base, with a novel nesting of con- 
formational sensitivity and specificity that can serve as a systematic framework for 
improved prediction from sequence as well as a number of future studies that would 
aid in knowledge-based antibody engineering such as humanisation. 



Subjects Bioinformatics, Computational Biology, Molecular Biology, Immunology 
Keywords Antibody structure, Canonical model, CDR conformation, Dynamic hybrid tree-cut, 
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INTRODUCTION 

Antibodies achieve the recognition and binding of antigens mainly by variation in the 
length and sequence of six loops called complementarity- determining regions (CDRs), 
three in the Light chain (CDR-L1, -L2, -L3) and three in the Heavy chain (CDR-H1, -H2, 
-H3). Early comparison of the experimental data suggested that CDRs usually adopt one 
of a limited number of possible conformations, depending on the presence of a few key 
residues in the sequence. This observation gave rise to the canonical model in which the 
three-dimensional conformation (or canonical class) of the corresponding loop could be 
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predicted from sequence templates for five of the six CDRs (Chothia et ah, 1986; Chothia 
et ah, 1989; Chothia et ah, 1992; Chothia &Lesk, 1987). Since this initial classification, 
further analysis has revealed novel classes, improved the predictability of the known 
ones, and offered insights into antigen recognition and binding mechanisms [Martin & 
Thornton, 1996; Al-Lazikani, Lesk & Chothia, 1997). Later, a number of studies (Shirai, 
Kidera &Nakamura, 1996; Shirai, Kidera & Nakamura, 1999; Furukawa et ah, 2001; Kuroda 
et ah, 2008) provided structure-determining sequence rules for the prediction of the base 
conformation of the sixth and final CDR-H3. 

Today, the increasing amount of new structural data presents an opportunity not 
only to improve the accuracy of conformational prediction from sequence alone, by 
identifying novel classes and reassessing the known ones; but also to study the basis of 
loop folding and gain insights into subtle antibody/antigen interactions. Steps are being 
taken in this direction that will enhance the capabilities of knowledge-based antibody 
engineering, e.g., humanization (Saldanha, 2009) and assist attempts at de novo antibody 
design (Yu et ah, 2012). In this study, an updated repertoire of CDR conformations 
was acquired by clustering and analysis of all available antibody loop structures. The 
primary goal was to create a complete repository of the redundant CDR conformational 
repertoire that is observed and deposited in the Protein Data Bank (PDB, Berman et ah, 
2000), i.e., obtain a classification for every single CDR, regardless of quality or sequence 
redundancies. This would allow a number of better informed, dedicated analyses regarding 
sequence-to-structure relations, induced fit, structural consistency, mutation studies or 
more targeted thermodynamic simulations. Most previous work was conducted when only 
a limited number of structures were available (Chothia et ah, 1989; Martin & Thornton, 
1996; Bane et ah, 1994; Rees et ah, 1994; Reczko et ah, 1995; Tomlinson et ah, 1995; Morea et 
ah, 1997; Guarne et ah, 1996; Morea et ah, 1998; Morea, Lesk & Tramontano, 2000; Oliva 
et ah, 1998), or only specific CDRs were targeted for clustering (Kuroda et ah, 2009; 
Teplyakov & Gilliland, 2014), or the selected datasets were heavily filtered in order to 
avoid redundancies and the inclusion of potentially wrong structures (North, Lehmann 
&Dunbrack, 2011). The automatically updated online repertoire AbYsis is maintained at 
http://www.bioinf.org.uk/abysis, however it doesn't annotate the redundant CDR content. 
In contrast, the very recently released CDR structural database SAbDab (Dunbar etah, 
2014) does contain the redundant CDR repertoire, but the characteristics of the clustering 
method employed are very different from the present work, as indicated later. 

A strategic decision was made to include all redundant CDR conformations, especially 
those from the same antibody presented in different PDB structure files and those from 
multiple copies of the same antibody variable chain within the same PDB file. Previous 
experience with examining CDR conformations suggested that different structures or 
copies of the same CDR may reveal its conformational flexibility, which is a useful aspect 
for molecular modellers and biologists who study the antigenic interface. By randomly 
selecting only one structure file and one variable chain copy of a given CDR, there is the 
risk of picking a non-representative instance which is different from the CDR's average 
conformation, or picking a structure that contains errors or invasive crystal packing. 
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Furthermore, random selection also removes from the dataset the possibility of observing 
an antibody in both its free and bound state, wherever this is available. Finally, it was 
judged that a poor average crystallographic resolution does not a priori point to a wrong 
structure and that a corresponding pre-filtering would potentially prevent the inclusion of 
new conformations in the repertoire. 

The second goal was to take advantage of all antibody structural information in order to 
create CDR clusters that can lead to advancement in the area of conformational prediction 
from sequence alone (Nikoloudis, Pitts & Saldanha, 2014). The enrichment of the cluster 
populations (CDRs with the same or similar conformations) with as many examples as 
possible is crucial to allow the making of connections between sequence and structure. 
The present analysis aimed to serve as a preliminary framework not only by producing an 
updated conformational dataset, but also by creating a novel nested clustering architecture 
that is more beneficial for prediction from sequence alone. Specifically, the nested 
repertoire tries to optimise the trade-off between the proliferation of sequence examples 
and a possible detrimental effect from small structure-solving errors. 

By including all available CDR structures in the dataset, any conclusions on conforma- 
tional validity were shifted to the post-clustering stage of analysis. However, at the same 
time there is an increase in noise of the dataset and as a consequence it was expected 
that the extents of some of the natural conformational clusters could be distorted or 
overlapping. These characteristics were taken into consideration in the design of the 
clustering steps in order to optimise the cluster separation, while minimising the loss 
of cluster specificity and/or sensitivity. The clustering procedure itself should help with 
the assessment of conformational validity and act as a first filter by efficiently excluding 
outliers from the natural clusters. 

METHODS 

Acquisition of antibody structure files 

The three-dimensional coordinates of all antibody structures were downloaded from 
the PDB (Berman et al, 2000). Since the presence of antibody variable chains inside a 
PDB file is not annotated in a unique and systematic way, the advanced search tool of the 
database was used in order to apply composite search filters. The simple text search query 
of the database with the keywords "antibody" or "immunoglobulin" returns hundreds of 
unwanted PDB files, for example those that only contain a constant antibody fragment 
(Fc) or those that contain the keyword in their primary citation without any relevant 
structures in the file. Conversely, in several cases, antibody variable chains (Fv) are found 
in PDB files that do not contain the keywords "antibody" or "immunoglobulin" at all. In 
order to refine the obtained results, multiple queries were run using a variety of relevant 
keywords and their combinations with appropriate logical AND/OR/NOT connectors. The 
keywords employed typically included: "antibody", "immunoglobulin", "Fab", "Fv", "Fc", 
"light chain", "heavy chain", "intact", "complete", "camelid" "llama", "VHH" "light dimer" 
and "Bence -Jones". 
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Table 1 Summary of clustering dataset contents. Total clustered members per CDR include outliers and 
singletons. 


total r Dd tiles 


1,351 


Files containing structures from two 

antiKnHip<;/iHintvnp*;-anti-iHintvr>p<; 

all LI l/ vJ O/ 1U1U IV ULj U-l 1 LI 1L11U IV ULO 


8/5 


Total antinnnv ctniftliTPQ 

J.VJLC11 dl 1L1UUUV 3L1 LllvLLU Ca 


1,359 


Total niimnpr or lT)"Rc 


13,086 


PDR 1 ? with tni-QQinty C^rv rnnrHinatpQ 


374 


Total rliiQtPi'pH (~"DR*; 


12,712 


CDR-1 1 clustered 


2,155 


CDR-L2 clustered 


2,174 


CDR-L3 clustered 


2,164 


CDR-H1 clustered 


2,057 


CDR-H2 clustered 


2,130 


CDR-H3 clustered 


2,032 


Total non-redundant CDR sequences 


2,827 


PDB files with lambda isotypes 


194 


Heavy only 


77 


Light only 


78 


PDB files with bound antibodies 


673 



The final dataset comprised of exactly 1,351 PDB structure files, 8 of which contain 
variable chains from two different antibodies (5 were idiotype-anti-idiotype complexes), 
increasing the total number of antibody structures to 1,359. The total number of included 
CDRs is 12,712, 2,827 of which are unique in sequence. Table 1 contains a summary of the 
dataset contents. The dataset was locked on the 31st of December 201 1 and should reflect 
the complete repertoire of antibody CDR structures up to that date. The set should be 
complete, given the proviso that there was a lack of specific tagging or annotation in the 
required PDB files. 

Numbering of antibody variable chains and definition of CDR 
extents 

All the antibody variable chain sequences in the dataset were structurally numbered in 
order to detect the beginning and end of each CDR, using regular expressions for the 
detection of the location of conserved sequence patterns. The initially adopted numbering 
scheme was the Chothia scheme (Chothia &Lesk, 1987) because it correctly places the 
insertion points in CDR-L1 and CDR-H1, but also because it is very frequently used in 
the CDR-related literature. The definitions used for the extents of CDRs-Ll, -L2, -L3 
and -H3 were also those established by Chothia & Lesk (1987) because they are most 
commonly used. However, for CDR-H1 and CDR-H2, the definitions adopted were those 
used in North, Lehmann & Dunbrack (2011). Based on previous experience from the 
visual examination of CDR-H1 structural superpositions, it was noted that the N-terminal 
portion of the loop where Rabat's (Kabat et al, 1991) and Chothia's CDR-H1 differ shows 
great variability both in sequence and structure. Thus, it was judged that this cluster 
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Figure 1 Superposition of 7-residue and 11-residue CDR-L2. The 5 C-terminal residues of 1A4K (in 
red) 7 residue CDR-L2 (L52-L56) are superposed to the equivalent portion of 3FFD (in blue) 1 1 residue 
CDR-L2. Position L51 is highlighted in green, as the best insertion point in the structural numbering 
scheme. Graphics created with Swiss-Pdb Viewer (http://www.expasy.org/spdbv/). 



analysis would be more revealing and useful if the CDR-H1 extent was considered as 
the entire length of the loop, namely residues H23-H35. As far as CDR-H2 was concerned, 
it was observed that the C-terminal portion of Kabat's definition (i.e., residues H59-H65) 
remained relatively unchanged conformationally in most CDRs. Therefore, only the length 
of the symmetrical loop portion between residues H50-H58 was retained for the CDR-H2 
definition. 

CDR length and numbering scheme amendments 

A number of antibodies contained a CDR with more residues than the current scheme 
could accept. The CDRs concerned were CDR-L2, -L3, -HI, -H2 and -H3. These CDRs, 
except for CDR-L2, already contained an insertion locus so the maximum allowed length 
was extended by adding more insertion positions (letters) to the numbering scheme. An 
insertion point was required in CDR-L2 for an 11-residue length. By superposing the 
new 11-residue loop (PDB code 3FFD) on a typical 7-residue one (1A4K), it was strongly 
suggested that the insertion point in CDR-L2 should be placed at position L5 1 (Fig. 1 ) . 

Two more cases required intervention in the numbering scheme. The first was in 
Light chain framework-3 (LFR3), where structure 1PW3 showed a 2-residue insertion. 
Superposition of this structure to the respective portion of a typical Light variable chain 
(1A4K) revealed that an insertion point should be introduced at position L67 (Fig. 2). 
The second case was raised by two anti-HIV antibodies observed in structures 3RPI and 
3SE8, showing an insertion of 3 and 7 residues respectively in Heavy chain framework-3 
(HFR3). Superposition of these frameworks onto a typical HFR3 (3MLY) suggested that 
an insertion point should be placed at residue H74 (Fig. 3). Table 2 summarises all the 
amendments brought to the initial numbering scheme in order to accommodate the 
special cases discovered in the dataset. 

Clustering overview 

In order to increase the usefulness of the clustering result in a way that meets the needs of 
a wider range of applications, a novel three-level nested cluster architecture was devised. 
At the parent-level, members of the same cluster share the least similarity in terms of 
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Figure 2 Superposition of Light Framework 3 with an insertion onto a typical LFR3. Residues L60-L75 
of crystal structure 1PW3 (in red), containing an insertion, are superposed onto a typical example of the 
equivalent Light chain fragment (here 1A4K, in blue). The new insertion point was introduced in position 
L67 (highlighted in green). Graphics created with Swiss-Pdb Viewer (http://www.expasy.org/spdbv/). 



Ca-atom Root-Mean-Square Deviation (RMSD), as the cluster is designed to include all 
the variants of a conformational theme within the limits of a statistical cluster validation. 
At the daughter-level, RMSD variance is successively reduced and members of the same 
cluster are increasingly similar. This stratified scheme could also be perceived as a variation 
of sensitivity to the potential natural flexibility of a CDR conformation (looser clusters), as 
well as a trade-off to the specificity of a particular shape (tighter clusters). 
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Figure 3 Superposition of Heavy Framework 3 with an insertion onto a typical HFR3. The Ca -trace 
of a two-leg superposition of residues H65-H73 and H76-H78 of crystal structures 3RPI (in yellow) 
and 3SE8 (in red), containing an insertion, onto the equivalent residues of a typical structure without 
an insertion (here 3MLY, in blue). The proposed insertion point H74 is highlighted in green in 3MLY 
and is shown with its side chain (Ser). Graphics created with Swiss-Pdb Viewer (http://www.expasy.org/ 
spdbv/). 



First-level clusters were formed by the use of a statistical clustering method, while 
second- and third-level clusters were defined using qualitative criteria. More specifically, 
the data was initially analysed by average- and complete-distance hierarchical clustering 
using RMSD distance matrices, and pruning of the resulting trees was performed with 
the Dynamic Tree Cut algorithm (Langfelder, Zhang & Horvath, 2007). RMSD distance 
matrices were obtained by performing all-by-all Ca-atom superpositions of the entire 
CDR loops, per individual CDR length. The result of hierarchical clustering was a set of 
level- 1 structural classes, as traditionally produced by various methods in all previous CDR 
conformational studies, meaning that members of the same cluster were similar to a degree 
that is defined by the tree-pruning and clustering criteria. 



Nikoloudis et al. (2014), PeerJ, DO1 1 0.771 7/peerj.456 



7/40 



PeerJ 



Table 2 Modifications brought to the numbering scheme. Modifications brought to the numbering 
scheme in the light of new and atypical sequences. LFR3, light chain framework 3; HFR3, heavy chain 
framework 3. CDR-H3 insertion positions HlOOuvw were not required in the present dataset, but were 
added for the technical continuity up to the pre-existing positions HlOOxyz and for future use. Thus 
3U1S has a CDR-H3 length of 31 residues. 



Locus 


Numbering 
scheme addition 


Maximum 
CDR length 


Structures with the 
new maximum length 


CDR 
extents 


CDR-L1 




17 


N/A 


L24-L34 


CDR-L2 


L51abcd 


11 


2GSG, 2H32, 2H3N, 20TU, 
20TW, 2QHR, 3FFD 


L50-L56 


LFR3 


L67ab 


N/A 


1PW3 


N/A 


CDR-L3 


L95cd 


13 


2GSG, 20TU, 2QHR, 3FFD, 
3MLW 


L89-L97 


CDR-H1 


H31cdefghijk 


24 


3K3Q 


H23-H35 


CDR-H2 


H52ef 


15 


3TWC, 3TYG 


H50-H58 


HFR3 


H74abcdefg 


N/A 


3SE8 


N/A 


CDR-H3 


H 1 OOnopqrstuvw 


34 


3U1S 


H95-H102 



Subsequently, c/)/i[r angles were calculated for all CDR residues, each residue was 
attributed to a Ramachandran region and Ramachandran logos were formulated for 
each CDR. For practical and computational reasons, the boundaries of the different 
Ramachandran regions were based on the Ramachandran Plot subdivision used by North, 
Lehmann &Dunbrack (2011) (Fig. 4). Two types of Ramachandran logos are defined for 
each CDR, namely one where similar conformational regions were represented by the same 
letter (also suggested in North, Lehmann & Dunbrack, 2011), which will henceforth be 
called the reduced-Ramachandran Logo or r-RL, and one where every conformational 
region is represented individually, called the full-Ramachandran Logo or f-RL. For 
the formation of level-2 clusters, the members of any given parent level- 1 cluster were 
regrouped by identical r-RL, meaning that members of the same cluster contain residues 
at each CDR position that belong to similar conformational regions. For the formation 
of level-3 clusters, the members of any given level-2 cluster were regrouped by identical 
f-RL, meaning that members of the same cluster contain residues at each CDR position 
that belong to the exact same conformational region. An example showing the layout of 
this nested cluster architecture can be seen in Fig. 5. Outliers/singletons were all given the 
tag '-O-' in their conformational logo, which created a common parent class that allowed 
the subsequent formation of 2nd- and 3rd-level clusters within outlier space, as well. 

Clustering method 

The RMSD distance matrices produced for each CDR/length combination were used for 
hierarchical analysis in the statistical package RGui (GNU project, http://www.sciviews. 
org/_rgui/). The average-linkage and complete-linkage algorithms were preferred to 
single -linkage in order to avoid chaining effects in dense configurations of the dataset 
in conformational space, and were both explored for every CDR/length combination. 
Hierarchical trees (dendrograms) that gave a Cophenetic Correlation Coefficient (CPCC) 
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Figure 4 Ramachandran plot divided into conformational regions. A: a-helix region; B: /i-sheet 
region; D: ^-region; G: /-region; L: left-handed helix region; P: polyproline II region. For the con- 
struction of reduced-Ramachandran logos (r-RL), residues belonging to regions with similar confor- 
mations were represented by the same letter: (A/D) = A, (B/P) = B, (L/G) = L. For the construction 
of full-Ramachandran logos (f-RL), each conformational region was represented individually. E.g., 
Ramachandran logos for CDR-L3 lTJH_L:r-RL: BBAABBBBB f-RL: BBDABPPPB. 

lower than 0.6 were directly discarded as pointing to poor fitting of the data. In all 
cases at least one of the hierarchical methods achieved a CPCC score greater than 0.6. 
Both hierarchical trees were considered whenever the CPCC was acceptable and 
comparatively evaluated using the criteria below. 

The Dynamic Hybrid Tree Cut method of the Dynamic Tree Cut statistical package in 
RGui was utilised for dendrogram pruning. The package has been previously successfully 
used for the detection of biologically meaningful clusters in a protein-protein interaction 
network in Drosophila (Dong & Horvath, 2007). The Dynamic Hybrid Tree Cut algorithm 
offers flexibility, by allowing the user to set the desired pruning parameters for cluster 
and outlier recognition. Specifically, the algorithm defines four cluster shape criteria: 
(1) the minimum number of cluster members (No, minClusterSize), (2) the maximum 
scatter of the pairwise distances between the lowest merged objects (CDR structures) in 
each cluster, called the cluster core (d max , maxAbsCoreScatter), (3) the maximum joining 
height at which a cluster attaches to the rest of the dendrogram (/z max > cutHeight), and 
(4) the minimum distance between the core scatter and the joining height of a cluster to 
the dendrogram, called the cluster gap (gxam> minAbsGap). The core scatter is defined as the 
average of all pairwise dissimilarities between objects belonging to the core of the cluster. 
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Figure 5 Example of the nested clusters architecture. Level-1 cluster H1-13-III (i.e., the third top-level 
cluster of 13-residues CDR-H1), defined by RMSD-based hierarchical clustering, contains 3 Level-2 
clusters, the members of each sharing the same reduced-Ramachandran logo, and in total 11 Level-3 
clusters, the members of each sharing the same full-Ramachandran logo. All Level-3 clusters share the 
same reduced-Ramachandran logo with their parent Level-2 cluster, but each one displays a distinct 
full-Ramachandran logo. 



Consequently, a branch is considered a cluster when it contains a minimum number of 
members (No), its joining height is at most h msix , its core is tightly connected (d max ) and 
distinct from its neighbourhood (g m i n )- Specifically, the minimum cluster gap distance 
(gmin) can be perceived as the minimum allowance for the cluster to expand its diameter 
from its core until it reaches a neighbouring cluster. 

Although these pruning parameters are explained in depth in the corresponding 
method paper (Dong & Horvath, 2007), an example of the application of pruning 
parameters to an actual dendrogram from this analysis can be seen in Fig. 6. The number of 
objects assigned to the core of a cluster is derived from the following implemented formula: 

n c = minjy" (JV 0 /2 + - N 0 /2),nJ (1) 

with n c the number of core objects, No the defined minimum cluster size and N the total 
number of objects in the cluster. As a consequence, the core of small clusters can be as large 
as the whole cluster, while the core of large clusters remains a fraction of the lowest joined 
objects. 
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Figure 6 Illustration of the parameters taken into account for the dendrogram pruning of CDR-L1/12 residues with the Dynamic Hybrid method. 
The minimum gap statistic (gmin) defines the minimum required distance between the average core scatter and the joining height of the clusters 
('Gap'), for successful cluster formation. In this example, gmin is set lower than the displayed Gaps, so nodes above its value were considered as 
different clusters. 



The algorithm examines the dendrogram in a bottom-up manner and attempts to 
perform three types of branch merges: a merge of two singletons which creates a new 
branch, the addition of a singleton to a branch, or a merge of two branches. In each 
step two branches are tested against the pruning criteria: if both considered branches 
satisfy the criteria then both are declared "closed" and no further objects are added in 
the current step. Otherwise, the branches are merged and this new group is reassessed for 
cluster conformity during the next merge with an adjacent branch. Objects too far from a 
cluster are left unlabelled as outliers. Once all possible object assignments are performed, 
the method allows a further optional 'Partitioning Around Medoids-like' step (PAM). 
During this step, unlabelled objects (outliers) are considered one-by-one and are assigned 
to existing clusters based on a user-defined maximum allowable distance, or when their 
distance is smaller than the cluster's radius. There are two options available for the cluster 
radius definition (parameter: useMedoids[=FALSE/TRUE]). If average distances are being 
used (FALSE), then the radius of the cluster is defined as the maximum of the average 
distances between objects in the cluster. If instead medoids are used (TRUE), then the 
radius is defined as the maximum distance of the cluster's medoid to the cluster's objects. 

In order to detect the pruning parameters that lead to the best clustering result, an 
R routine was created which cycles the pruning method through a range of h max , then 
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gmin> then d max using 0.1 increment steps. In each step, the quality of the clusters was 
assessed by calculation of the average Silhouette Coefficient (SC) and a cut-off of 0.51 was 
defined as the minimum required coefficient value for a reasonable structure to be found. 
The minimum number of members per cluster (No) was set to 2, in order to make sure 
that true singletons that could not form a compact cluster core with sufficient separation 
from neighbouring clusters were left as outliers. The output of this routine returned the 
clustering parameters, the number of clusters and outliers, the average SC and an auxiliary 
index showing the ratio of outliers over clusters. 

Multidimensional scaling was applied to all distance matrices and 2D maps were 
produced for visual inspection of the clusters. In addition, 3D maps were created and 
consulted through the visualisation tool GNUPLOT (Williams et al, 2007-2011), for better 
perception of the configuration of the global population of each CDR/length combination. 
The 2D/3D maps and the respective Silhouette Plots of pruning results with average SC 
greater than 0.51 and all positive individual Silhouette Widths (SW) were consulted in 
all cases in order to continually have a visual appreciation of the data configuration and 
clustering evolution, and to make informed decisions which allowed the final formalisation 
of the clustering procedure. Given that the desired clustering result would ideally produce 
as many well separated clusters and as few outliers as possible, the auxiliary index offered a 
quick composite comparison between pruning results, and was defined as: 



where S is the number of outliers/ singletons and C the number of clusters. The unit ( 1 ) was 
added to the index's numerator in order to allow the comparison between pruning results 
with 0 outliers/singletons, but a different number of clusters. 

Another index employed during the clustering procedure was that of the ideal 
maximum cluster diameter, which took into consideration the examined CDR length (Z): 



The rationale behind this formula was to define an ideal maximum diameter by adding 
or subtracting 0. 1 A per residue respectively above or below a length of 9. For a CDR with 
9-residues, this diameter was set empirically at 1.0 A, based on experience of manual 3D 
superpositions of CDR-L3/9-residues with the graphics program Swiss-Pdb Viewer (Spdbv; 
Guex & Peitsch, 1997). Observations suggested 1.0 A to be an appropriate cut-off for 
significant visual conformational similarity for CDRs of this length. This auxiliary index 
played no further analytical role than to merely define a cut-off at which the possibility 
of cluster splitting was to be explored during the clustering procedure. In no case did 
it impose a diameter threshold for cluster formation. Conversely, cluster merging was 
explored between clusters that contained one or more members with greater affinity for 
the second cluster (revealed by its negative SW). If the merge resulted in a global average 
SC > 0.51 then it was retained, otherwise the entire partition was discarded. In the end, 
the preferred clustering parameters were those that resulted in global average SC equal 



a={\ + S)/C 



(2) 




(3) 
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or higher than 0.51, all positive individual SWs and the lower auxiliary index a (Eq. (2)). 
If the number of outliers remained high, the optional PAM-stage was applied at the end of 
the tree cut procedure, but its results were only retained if all of the above partition quality 
criteria were satisfied. 

When the optimal clustering result was obtained, the clusters' cores, medoids, most 
distant members and their diameters were extracted for that CDR/length combination via 
a dedicated R routine. Clustering summaries were created with Java code, as well as lists 
and various post-analytical data that are detailed later. 

RESULTS 
Clustering results 

Tables of results were constructed for 58 CDR/length combination, gathering information 
that describes each individual cluster, which can be consulted for quick reference 
(Tables 3-7 for CDR-L1/-L2/-L3/-H1/-H2 and a separate supplementary table for 
CDR-H3, Supplemental Information 4). A summary table with all clustered lengths is 
available in Table 8. Detailed membership assignments can be found in two forms: one 
where every CDR is shown in alphabetical PDB order with all available clustering and 
data-mined information (cisl 'trans peptides, structure resolution, crystal space group, 
sequence, Ramachandran logos, cluster core label, bound state, light isotype, heavy or light 
chain only) and one where the same information is given in cluster order (Supplemental 
Information 6 and Supplemental Information 5). The co-angle cut-off for czs-peptide 
detection was set to ±30°; absence of czs-content that satisfied these limits resulted in an 
all-trans (allT) label. Bound state was flagged based on a list of bound antibodies obtained 
from SAbDab (Dunbar et at, 2014). This list did not contain idiotype-anti-idiotype 
complexes, therefore the 5 such files in the dataset were additionally flagged as bound 
(entries 1CIC, 1DVF, 1IAI, 1PG7, 3BQU). 

Comparison of clustering results 

The level- 1 clusters obtained in this work were compared to the clustering results 
of previous major CDR studies (Tables 9-13 for CDR-L1, -L2, -L3, -HI and -H2, 
Supplemental Information 2 for CDR-H3). Specifically, comparisons were made with 
the clusters found in Martin & Thornton (1996) because it was the first five CDR clustering 
performed on a significant CDR dataset (57 antibody structures, 269 CDRs), presented 
most major conformational classes and for these reasons is regularly cited in research of 
this kind. Comparisons were also made with the clustering results in North, Lehmann 
& Dunbrack (2011) as this is the most recent relevant analysis, which used the largest 
CDR dataset (932 antibody structures before filtering, 1897 CDRs after filtering) until the 
present study. Also included were the results from Kuroda et al. (2009) for the comparisons 
in CDR-L3, as this recent dedicated analysis used an RMSD-based approach, as is the 
case in this work, while using a considerable number of CDR structures (212 CDR-L3 
structures). For the first five CDRs, the present study comprised 1,359 antibody structures 
and 10,680 CDRs (and a total of 12,712 CDRs including CDR-H3). Commenting on these 
comparisons is made in the discussion section below. 
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Table 8 Summation of clustered lengths per CDR. (A) Summation of clustered lengths per CDR, with population, non-redundant sequences, 
number of clusters and outliers information. CDR lengths that were clustered for the first time are highlighted in bold/italics. (B) The complete 
CDR-H3 conformation, using the H95-H102 extents definition, has not been extensively clustered before; therefore only lengths that were not 
considered in Kuroda et al. (2009) are noted as new for conformity with the literature. CDR-H3 lengths 4 and 24 are marked with an asterisk as the 
corresponding structures are also found in North, Lehmann & Dunbrack (2011), but acknowledged as 2 residues longer, due to different CDR-H3 
extents (H93-H102). 



(A) 


CDR 


Observed lengths 
(new lengths) 


Total structure 
population 


Unique 
sequences 


Level- 1 
clusters 


Level- 1 only 
structure population 


Singletons/outliers 




7 


2 


1 


1 


2 


0 




9 


10 


4 


2 


10 


0 




10 


127 


28 


1 


126 


1 




11 


1,042 


180 


4 


1,033 


9 


LI 


12 


82 


26 


4 


81 


1 


13 


81 


26 


3 


81 


0 




14 


207 


25 


7 


193 


14 




15 


80 


34 


2 


32 


48 




16 


352 


74 


5 


319 


33 




17 


172 


36 


1 


171 


1 


Total 


10 lengths 


2,155 


434 


30 


2,048 


107 


CDR 


Observed lengths 


Total structure 
population 


Unique 
sequences 


Level- 1 
clusters 


Level- 1 only 
structure population 


Singletons/outliers 


L2 


7 


2,161 


278 


3 


2,159 


2 


11 


13 


3 


2 


13 


0 


Total 


2 lengths 


2,174 


281 


5 


2,172 


2 


CDR 


Observed lengths 
(new lengths) 


Total structure 
population 


Unique 
sequences 


Level- 1 
clusters 


Level- 1 only 
structure population 


Singletons/outliers 




5 


10 


4 


1 


10 


0 




7 


5 


2 


1 


5 


0 




8 


138 


43 


6 


136 


2 


L3 


9 


1,725 


358 


6 


1,720 


5 


10 


113 


27 


12 


107 


6 




11 


142 


38 


9 


135 


7 




12 


19 


6 


4 


19 


0 




13 


12 


2 


3 


11 


1 


Total 


8 lengths 


2,164 


480 


42 


2,143 


21 


CDR 


Observed lengths 
(new lengths) 


Total structure 
population 


Unique 
sequences 


Level- 1 
clusters 


Level- 1 only 
structure population 


Singletons/outliers 




10 


6 


2 


1 


6 


0 




12 


2 


2 


0 


0 


2 




13 


1,845 


450 


11 


1,681 


164 


HI 


14 


72 


17 


1 


70 


2 




15 


128 


29 


3 


125 


3 




16 


3 


2 


1 


2 


1 




24 


1 


1 


0 


0 


1 


Total 


7 lengths 


2,057 


503 


17 


1,884 


173 


(continued on next page) 
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Table 8 (continued) 

(B) 



CDR 


Observed lengths 


Structure 


Unique 


Level- 1 


Level- 1 only 


Singletons/outliers 




(new lengths) 


population 


sequences 


clusters 


structure population 




8 


6 


2 


1 


6 


0 




9 


436 


117 


6 


435 


1 


H2 


10 


1,508 


381 


10 


1,356 


152 


11 


3 


3 


0 


0 


3 




12 


171 


38 


4 


171 


0 




15 


6 


3 


2 


5 


1 


Total 


6 lengths 


2,130 


544 


23 


1,973 


157 


CDR 


Observed lengths 


Structure 


Unique 


Level- 1 


Level- 1 only 


Singletons/outliers 




(new lengths) 


population 


sequences 


clusters 


structure population 




3 


18 


4 


1 


lo 


A 
U 




4* 


38 


12 


2 


JO 


•> 




5 


93 


28 


6 


Q C 

03 


Q 




6 


33 


12 


3 


DU 


D 




7 


97 


41 


7 


oy 






8 


168 


46 


7 


141 


LI 




9 


181 


55 


8 


132 


4y 




10 


377 


98 


35 




OZ> 




11 


231 


64 


26 


151 


80 




12 


206 


51 


21 


174 


32 




13 


130 


42 


22 


105 


25 




14 


128 


40 


19 


104 


24 


H3 


15 


96 


23 


18 


81 


15 




16 


40 


16 


8 


28 


12 




17 


28 


14 


6 


19 


9 




18 


37 


11 


6 


31 


6 




19 


48 


12 


9 


46 


2 




20 


13 


4 


3 


13 


0 




21 


10 


1 


1 


10 


0 




22 


33 


4 


2 


31 


2 




23 


1 


1 


0 


0 


1 




24* 


12 


2 


2 


12 


0 




25 


1 


1 


0 


0 


1 




28 


12 


2 


1 


12 


0 




31 


1 


1 


0 


0 


1 


Total 


25 lengths 


2,032 


585 


213 


1,620 


412 


Cumulative 


58 lengths 


12,712 


2,827 


330 


11,840 


872 



total 

(all CDRs) 
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Table 9 Comparison of level- 1 conformational clusters obtained in CDR-L1 with external sets. The 

cluster medoid/median or representative of the external sets was used for identification of correspon- 
dences. In brackets, next to each correspondence, is the full, 3 -level classification in this work of the 
representative of the external set and the number of corresponding members in full population compar- 
ison. Martin & Thornton (1996) cluster 14F is marked with a question mark, because its representative 
(2BJL, superseded by 4BJL) actually has a 13-residue CDR-L1. 



This work 
[CDR-L1 cluster] 



Martin eS~ Thornton, 1996 
(corresponding cluster/canonical) 
(level-3 of external median) 
(corresponding members) 



North, Lehmann & Dunbrack, 201 1 
(corresponding cluster) 
(level-3 of external median) 
(corresponding members) 



L1-7-I 












L1-9-I 












L1-9-II 












L1-10-I 


10A/1 (L1-10-I-1-1) (4/4) 


LI- 
LI- 


10- 
10- 


■1 (LI 
■2 (LI- 


-10-1-1-1) (20/20) 
-10-1-2-2) (2/2) 


Ll-ll-I 


11A/2 (Ll-ll-I-2-1) (22/22) 


Ll- 
Ll- 


■11- 
■11- 


■1 (LI- 
■2 (LI- 


■11-1-1-2) (76/76) 


Ll-ll-II 




Ll- 


■11- 


■3 (LI- 


■11-11- 1-2) (3/5) 


Ll-ll-III 


11B/- (Ll-ll-III-1-1) (111) 










Ll-ll-IV 












L1-12-I 




Ll- 


■12- 


■1 (LI- 


-12-1-1-1) (5/5) 


L1-12-II 




Ll- 


■12- 


■2 (LI- 


-12-11-1-2) (4/5) 


L1-12-III 












L1-12-IV 




Ll- 


■12- 


'S (LI 


■12-IV-1-2) (2/2) 


L1-13-I 


\3KI5X(Ll-13-l-l-2) (2/2) 
14F/-?(Ll-13-I-7-l) (111) 


Ll- 


13- 


■1 (LI- 


■13-1-1-2) (7/7) 


L1-13-II 




Ll- 


13- 


■2 (LI- 


-13-11-1-1) (414) 


L1-13-III 












L1-14-I 


UB/7X(Ll-14-I-2-3) (3/3) 


Ll- 


■14- 


■1 (LI 


■14-1-1-3) (14/14) 


L1-14-II 


14C/- (L1-14-II-13-1) (111) 
UE/-(L1-14-II-14-1) (111) 


Ll- 


■14- 


■2 (LI- 


-14-11-4-1) (3/4) 


L1-14-IH 












L1-14-IV 












L1-14-V 


UAI6X(Ll-14-V-l-2) (111) 










L1-14-VI 












L1-14-VU 












L1-15-I 




Ll- 


■15- 


■1 (LI- 


■15-1-1-11) (8/11) 


L1-15-II 


L1-16-I 


16A/4 (L1-16-I-1-51) (819) 
16C/-(Ll-16-I-l-20) (111) 


Ll- 


16- 


■1 (LI- 


■16-1-1-1) (62/68) 


L1-16-II 












Ll-16-in 












L1-16-IV 












L1-16-V 


L1-17-I 


17A/3 (L1-17-I-1-17) (4/4) 


Ll- 


■17- 


■1 (LI- 


■17-1-1-3) (21/21) 



(continued on next page) 
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Table 9 (continued) 



This work 
[CDR-L1 cluster] 


Martin & Thornton, 1996 
(corresponding cluster/canonical) 
(level-3 of external median) 
(corresponding members) 


North, Lehmann & Dunbrack, 201 1 
(corresponding cluster) 
(level-3 of external median) 
(corresponding members) 


Outliers 


Ll-12-O 


12A/6 (Ll-12-0-1-1) (1/1) 




Ll-14-O 


14D/- (Ll-14-0-3-1) (1/1) 




Ll-15-O 


15A/5 (Ll-15-0-6-1) (111) 
15B/- (Ll-15-0-1-4) (2/2) 


Ll-15-2 (Ll-15-0-3-1) (2/2) 


Ll-16-O 


16B/- (Ll-16-OS-l) (2/2) 






Table 10 Comparison of level- 1 conformational clusters obtained in CDR-L2 with external sets. See 

notes in Table 9. In North, Lehmann & Dunbrack (2011), the CDR extents were defined as L49-L56, 
instead of L50-L56; hence a direct comparison is not possible. Nonetheless, since position L49 is fairly 
conserved structurally and for reference reasons, a correspondence of the longer by 1 residue clusters is 
shown, based on the representative of those clusters (in square brackets and in full-italics). 



This work Martin & Thornton, 1996 North, Lehmann & Dunbrack, 201 1 

[CDR-L2 cluster] (corresponding cluster/canonical) (corresponding cluster) 

(level-3 of external median) (level-3 of external median) 

(corresponding members) (corresponding members) 



L2-7-I 7 A/1 (L2-7-I-2-1) (55/55) 


[L2-8-1 (L2-7-I-2-1) (2901290) 
L2-8-2 (L2-7-I-6-2) (9/9) 
L2-8-4 (L2-7-I-10-1 ) (2/2) 
L2-8-5 (L2-7-I-14-2) (212)] 


L2-7-II 


[L2-8-3 (L2-7-II-1-2) (3/3)] 


L2-7-III 7B/1 (L2-7-III-1-6) (111) 


L2-11-I 


[L2-12-2 (L2-11-I-1-1) (2/2)] 


L2-11-II 


[L2-12-1 (L2-11-II-2-1) (2/2)] 


Rogue clusters and sequences 



Assigned as 'rogue' were two conformational clusters that contain one or more members 
with identical CDR sequences. This definition was first used for CDR conformations by 
Martin & Thornton (1996) with respect to their unpredictability by canonical sequence 
templates when all their key residues are overlapping. In this work there is an expansion 
of this notion with the term 'rogue CDR sequences'. This refers specifically to those 
identical sequences that are found to exist with more than one distinct conformation. 
The extraction of such sequences allows for further investigation, which can reveal any 
particular circumstances or neighbouring sequence features that led to a different CDR 
conformation despite the identical sequence. For example, examination of antibody Fvs 
with rogue CDR sequences may reveal the influence of neighbouring main-chain atoms, a 
particular framework residue influencing the CDR conformation, a conformational switch 
due to interface interactions (e.g., with an antigen), intrusive crystal-packing interactions, 
or even suggest some experimental error. 
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Table 11 Comparison of level- 1 conformational clusters obtained in CDR-L3, with external sets. See notes in Table 9. In Kuroda et al. (2009), 

no cluster representatives are available, so the cluster member with the best resolution was arbitrarily selected in each case, in order to identify the 
correspondences with the results from the present study. 



This work 
[CDR-L3 cluster] 


Martin & Thornton, 1996 
(corresponding cluster/canonical) 
(level-3 of external median) 
(corresponding members) 


North, Lehmann & Dunbrack, 2011 
(corresponding cluster) 
(level-3 of external median) 
(corresponding members) 


Kuroda et al., 2009 
(corresponding cluster) 
(representative) 

(level-3 of external representative) 
(corresponding members) 


L3-5-I 


L3-7-I 


7A/4 (L3-7-I-1-2) (1/1) 


L3-7-1 

(L3-7-I-1-2) (2/2) 


4(1MIM) 
(L3-7-I-1-1) (111) 


L3-8-I 


8BI- (L3-8-I-1-1) (1/1) 


L3-8-1 

(L3-8-I-1-1) (14/15) 


3B(1PZ5) 
(L3-8-I-2-1) (4/4) 
6(1 Q9W) 
(L3-8-I-1-1) (6/6) 


L3-8-II 




L3-8-cis6-l 
(L3-8-II-2-1) (3/3) 


7(2FAT) 

(L3-8-II-2-1) (2/2) 


L3-8-III 


8A/3 (L3-8-III-1-1) (111) 


L3-8-2 

(L3-8-III-2-1) (3/4) 


3A(1YQV) 
(L3-8-III-1-1) (2/2) 


L3-8-IV - 


L3-8-V - 


L3-8-VI - 


L3-9-I 


9A/1 (L3-9-I-1-1) (40/40) 


L3-9-cis7-l 
(L3-9-I-1-1) (219/219) 
L3-9-2 

(L3-9-I-9-1 ) (12/12) 
L3-9-cis7-2 
(L3-9-I-15-2) (8/8) 
L3-9-cis7-3 
(L3-9-I-12-4) (2/2) 


1(1MJU) 

(L3-9-I-1-2) (159/161) 


L3-9-II 


9C/4A. (L3-9-II-1-8) (2/2) 
9DI- (L3-9-II-1-4) (2/2) 
9E/1 (L3-9-II-5-1) (111) 


L3-9-1 

(L3-9-II-2-1) (17122) 


1A(1A6V) 
(L3-9-II-1-4) (5/5) 
IB (7FAB) 
(L3-9-II-1-8) (111) 
1C (1Q0X) 
(L3-9-II-2-2) (2/2) 


L3-9-III 


9B/2 (L3-9-III-1-1) (111) 
9F/- (L3-9-III-7-1) (111) 


L3-9-cis6-l 
(L3-9-III-1-1) (111) 


(9-)2 (2FBJ) 
(L3-9-III-1-1) (111) 


L3-9-IV - 


L3-9-V - 


L3-9-VI - 


L3-10-I - 


L3- 10-11 - 


L3-10-III 




L3-10-1 (L3-10-III-1-2) (2/6) 




L3-10-IV 




L3-10-cis7,8-l (L3-10-IV-1-2) (111) 


5(1JGU) (L3-10-IV-1-2) (111) 


L3-10-V - 


L3-10-VI - 


L3-10-VH 


\QBI-(L3-10-VlI-3-l) (111) 







(continued on next page) 
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Table 11 (continued) 



This work 
[CDR-L3 cluster] 


Martin & Thornton, 1996 
(corresponding cluster/canonical) 
(level-3 of external median) 
(corresponding members) 


North, Lehmann & Dunbrack, 2011 
(corresponding cluster) 
(level-3 of external median) 
(corresponding members) 


Kuroda et al, 2009 
(corresponding cluster) 
(representative) 

(level-3 of external representative) 
(corresponding members) 


L3-10-VIII 


- 


- 


- 


L3-10-IX 


- 


- 


- 


L3-10-X 


- 


- 


- 


L3-10-XI 


- 


L3-10-cis8-l (L3-10-XI-1-2) (1/2) 


- 


L3-10-XII 


IOC/- (L3-10-XII-3-1) (1/1) 
10D/- (L3-10-XII-8-1) (1/1) 


— 


- 


L3-11-I 


11A/5A (L3-11-I-1-1) (2/2) 


L3-11-1 (L3-11-I-1-2) (8/9) 


(ll-)2 (2FB4) (L3-11-I-1-1) (3/5) 


L3-11-II 




L3-ll-cis7-l (L3-11-II-1-2) (111) 


8(2NY1) (L3-1 1-11- 1-2) (111) 


L3-11-III 








L3-11-IV 








L3-11-V 


11B/- (L3-11-V-1-1) (111) 






L3-11-VI 








L3-11-VII 








L3-11-VIII 








L3-11-IX 








L3-12-I 








L3-12-II 




L3-12-1 (L3-12-II-1-1) (111) 




L3-12-III 








L3-12-IV 








L3-13-I 




L3-13-1 (L3-13-I-1-1) (1/3) 




L3-13-II 








L3-13-III 








Outliers 


L3-10-O 


10A/5 (L3-10-O-6-1) (111) 







All cluster populations were parsed for rogue CDR sequences and a list of CDRs, 
tagged by their cluster assignment, was created for future detailed analysis (Supplemental 
Information 1). Also in the same file, entries with completely identical Fvs which belong 
to different conformational clusters (full-chain rogues) are reported separately, while 
entries containing bound antibodies are flagged as such by an asterisk. Furthermore, 
cluster populations were compared in all CDR/length sets, and the minimum number of 
amino acid differences, position-by-position, was calculated between any two sequences of 
different clusters. This difference was termed the 'minimum pairwise Sequence Distance 
between clusters', or mSD (essentially a minimum Hamming distance between sequences). 
Matrices showing the mSD between all clusters were constructed for every CDR/length, 
and heatmaps were produced in order to allow a quick visual appreciation of the degree 
of sequence dissimilarity between clusters (Supplemental Information 3). The purpose 
of these heatmaps is to assist mutation studies by promptly directing the researcher to 
clusters/CDR sequences of interest, as well as sequence-to-structure studies by biologists or 
modellers. 
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Table 12 Comparison of level- 1 conformational clusters obtained in CDR-H1 with external sets. See 

notes in Table 9. In Martin & Thornton (1996), the CDR extents definition is significantly different 
(H26-H35), but correspondences based on median structures are shown for reference (in square brackets 
and full-italics). 

This work Martin & Thornton, 1996 North, Lehmann & Dunbrack, 201 1 

[CDR-H1 cluster] (corresponding cluster/canonical) (corresponding cluster) 

(level-3 of external median) (level-3 of external median) 

(corresponding members) (corresponding members) 



H1-10-I 




HI- 


10 


-1 (HI- 10-T- 1-7) (7/7) 


H1-13-I 


tlOA/1 (HI -1 1-T-1 -7) (41/44)1 

[ 1 L//1/ 1 [111 1J 1 1 jL j ( rt~Jf 1 1 J j 


HI- 


13- 


-1 (HI -11-T-1 -1 ) (761/767) 






Hl- 


13- 


-2 (H1-13-I-13-4) (2/7) 






Hl- 


13- 


-4 (H1-13-I-2-19) (3/4) 






Hl- 


13- 


■7 (H1-13-I-8-4) (3/3) 


H1-13-II 




Hl- 


13- 


-8 (H1-13-II-4-1) (2/3) 


H1-13-III 




Hl- 


13- 


-6 (H1-13-III-1-2) (2/4) 






Hl- 


13- 


-cis9-l (H1-13-III-2-4) (2/2) 


H1-13-IV 










H1-13-V 










H1-13-VI 










H1-13-VII 










Hl-13-Vin 




Hl- 


13- 


-5 (H1-13-VIII-1-5) (4/4) 


H1-13-IX 










H1-13-X 










H1-13-XI 










H1-14-I 


[11A/2 (H1-14-I-11-1) (111)] 


Hl- 


14- 


-1 (H1-14-I-3-11) (11/11) 


H1-15-I 


[12A/3 (H1-15-I-2-7) (111)} 


Hl- 


15- 


-1 (HI- 15-1-2-3) (9/9) 


H1-15-II 


H1-15-III 










H1-16-I 










Outliers 


Hl-12-O 




Hl- 


12- 


-1 (Hl-12-O-l-l) (111) 


Hl-13-O 


[10B/1 (Hl-13-0-66-1) (111) 


Hl- 


13- 


-3 (Hl-13-0-14-1) (5/5) 




10C/1 (H1-13-O-20-3) (111) 


Hl- 


13- 


-9 (Hl-13-0-57-1) (1/3) 




10D (Hl-13-0-31-1) (111)] 


Hl- 


13- 


-10 (Hl-13-0-34-1) (2/2) 






Hl- 


13- 


-11 (Hl-13-0-56-1) (1/2) 


Hl-16-O 




Hl- 


16- 


■1 (Hl-16-O-l-l) (111) 


Hl-24-O 



Investigation of structure resolution in outlier space 

As a preliminary layer of quality assessment for the outliers in the present clustering, 
the min, max, average and median resolutions were calculated in clustered and outlier 
spaces per CDR/length (-L1, -L3, -HI, -H2, being of the highest interest). These values 
were plotted as stock charts for comparison, in order to observe any global correlation 
between the outlier space content and possibly erroneous CDR structures due to poor 
resolution (Supplemental Information 7). In only four cases (CDR-H1/15-, CDR-H1/16-, 
CDR-L1/12- and CDR-Ll/16-residues) was the median resolution of outlier space found 
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Table 13 Comparison of level- 1 conformational clusters obtained in CDR-H2, with external sets. See 

notes in Table 9. 



This work 

| V ,1 'IV 1 1Z. 1,1 US LCI J 


Martin eS~ Thornton, 1996 

I rnrrpsnnnnitio r"lii^tpr/f anrniic 

\CU1 1 1 o llvll CI 11.3 H_l / Lallulllk.al j 

(level-3 of external median) 
(corresponding members) 


North, Lehmann &Dunbrack, 2011 

I mTTPcnfitiHincr flncl~pi*\ 

(level-3 of external median) 
(corresponding members) 


H2-8-I 






H2-9-I 


9A/1 (H2-9-I-1-1) (8/8) 


H2-9-1 (H2-9-I-1-1) (76/77) 
H2-9-3 (H2-9-I-3-2) (2/2) 


H2-9-II 




H2-9-2 (H2-9-II-1-2) (2/2) 


H2-9-III 






H2-9-IV 






H2-9-V 






H2-9-VI 






H2-10-I 


10A/2 (H2-10-I-1-6) (17/21) 


H2-10-1 (H2-10-I-1-3) (151/155) 


H2-10-II 


10B/3 (H2-10-II-1-4) (11/11) 


H2-10-2 (H2-10-II-l-l)(40/42) 
H2-10-4 (H2-10-II-4-1) (7/7) 
H2-10-5 (H2-10-II-3-1) (3/3) 


H2-10-III 






H2-10-IV 






H2-10-V 






H2-10-VI 






H2-10-VII 






H2-10-VIII 






H2-10-IX 


H2-10-X 






H2-12-I 


12A/4 (H2-12-I-5-1) (2/2) 
12B/4 (H2-12-1-1-11) (2/2) 


H2-12-1 (H2-12-I-1-1) (26/26) 


H2-12-II 


H2-12-III 






H2-12-IV 






H2-15-I 




H2-15-1 (H2-15-I-1-1) (111) 


H2-15-II 






Outliers 


H2-10-O 


10C/3 (H2-10-O-20-1) (2/2) 
10D/2 (H2-10-O-36-1) (111) 
10E/2 (H2-10-O-34-1) (111) 
10F/2 (H2-10-O-11-2) (111) 


H2-10-3 (H2-10-O-3-10) (10/11) 
H2-10-7 (H2-10-O-20-1) (2/2) 
H2-10-8 (H2-10-O-13-1) (1/2) 
H2-10-9 (H2-10-O-29-3) (2/2) 



to be more than 0.5 A higher than the respective median in clustered space, and in only 
two cases (CDR-Hl/15-residues, 3 outliers in total, and CDR-Ll/12-residues, 1 outlier in 
total) was the outlier median resolution value above 2.8 A. In conclusion, average structure 
resolution does not appear to be a determinant factor of the outlier content, although 
it remains possible that wrong structures due to poor resolution may exist between the 
outliers. In fact, as proposed throughout this work, any decisions on structure validity 
should be considerably easier to make during targeted analysis of the structures/ clusters 
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of interest, when using the results of the present clustering. The supplementary file 
(Supplemental Information 7) also contains complementary bar charts showing the 
percentages of bound content in outlier and clustered space. 

DISCUSSION 

The early approach to CDR conformational classification defined a strict threshold of simi- 
larity for clusters, beyond which any new conformation becomes the first member of a new 
class/cluster. As the number of new antibody structures increased almost exponentially 
in the past decades, the definition of a strict similarity threshold became problematic as 
many conformational variants of known classes appeared in the similarity-criterion space 
between different clusters. An obvious solution to this new and complex data structure 
was the pre-exclusion of all structures with characteristics that could potentially point to 
wrong conformations, or essentially be characterised as "noise" in the data. For instance, in 
the latest CDR clustering (North, Lehmann &Dunbrack, 2011), the data was considerably 
simplified by removing structures based on several filtering criteria: crystal resolution; 
high CDR backbone, or non-reported B-factors; presence of a's-peptide bonds for residues 
other than a proline; highly improbable backbone conformations and loops with very 
high conformational energies. In the present study however, the goal was set to obtain 
a classification for every available CDR, so any "data noise" had to be handled by the 
clustering methodology. 

The primary characteristic of the CDR clustering performed in this study is that the 
main, or level- 1, clusters do not carry a pre-defined degree of conformational similarity. 
This would require the strict definition of a threshold in the RMSD distance on all 
Ca-atoms from the cluster's medoid, or as a maximum cluster diameter (e.g., Martin & 
Thornton, 1996; Kuroda etal, 2009). Alternatively in North, Lehmann &Dunbrack (2011), 
a dihedral angle-based distance measure was used in order to define a threshold for cluster 
merging (65° between each dihedral pair), while the main clustering method (an affinity 
algorithm) practically produced a final result that is roughly equivalent or close to the 
level-2 clustering in this study (clustering by r-RL). In contrast in this study, level- 1 clusters 
were formed with no use of discreet distance thresholds whatsoever, but instead based on 
the greater affinity of each object towards its assigned cluster as expressed by the all-positive 
SWs; while the average SC ensured a typically textbook-defined, reasonable or better global 
partition of clusters (SC > 0.51). 

This approach was selected for two reasons: (1) in order to reduce the subjectivity 
that is inherent with every threshold definition and clustering decision in general, 
and (2) in order to allow the adherence of conformational variants to their most 
apparent closest conformational theme. This in turn may reveal the natural flexibility 
in physiological conditions, or structural mechanisms and synergies that are specific 
to an antibody's function. Indeed, it becomes more straightforward to comparatively 
examine the reason for a conformational variant when it is found connected to its closest 
conformational theme, rather than when treated as a completely distinct conformation 
or as an outlier/singleton. This is also the most important difference between the present 
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antibody CDR clustering analysis and the clustering by UPGMA offered by the recently 
released CDR structural database SAbDab (Dunbar etal., 2014). 

The clustering algorithm employed in this study offered simultaneous flexibility in 
selecting the most appropriate pruning parameters, and in-depth description of clusters by 
its definition of cluster core objects. Researchers wishing to retrieve the most representative 
objects (the most tightly represented conformation) of each cluster may select any one of 
the cluster's core CDRs (tagged as such in the clustering results listings). Furthermore, the 
presentation of each cluster's extremities in the results (most distant members forming the 
cluster's diameter), allows the rapid assessment of the extents of conformational variability 
of the cluster so that researchers can make informed decisions as to the importance of any 
observed deviations of their target structure with regard to the overall conformational 
characteristics of the cluster. 

In practice over 80% of the clustering was straightforward in establishing a partition 
with an SC > 0.51, all positive individual SW, the highest number of clusters possible 
with close-to-ideal maximum diameters and the lowest number of outliers. In fact, the 
formalisation of the complete procedure contains few subjective features, namely those 
of the ideal maximum cluster diameter index and of the overall stringency in examining 
all possible outcomes (average and complete hierarchical trees, 2nd-stage PAM). In the 
first case, the index had a merely suggestive role in triggering the assessment of a possible 
cluster splitting strategy, while in the second case the optional PAM stage or one of the 
two hierarchical methods may be completely omitted, especially if an acceptable result is 
already obtained. Therefore, this clustering method can be entirely machine-coded and 
carried out in a fully automated way, if required. 

The major challenge in this clustering was brought by the initial decision to include all 
the available antibody structures as of the 31st December 201 1 edition of PDB, in order 
to create a complete CDR conformational repertoire. While this decision allowed a richer 
result, and for all the reasons and possible advantages detailed earlier, it was accepted 
that noise was added to the dataset by the inclusion of a number of potentially erroneous 
structures. The usual strategy followed in such cases is data re-sampling, or bootstrapping, 
in order to assess the effects and influence of noise to the dataset configuration by some 
estimator (e.g., percentiles, medians, variance, etc.) and to attempt projections for the 
evolution of partitions in the future. There was reluctance in pursuing such a methodology 
in this case, mainly because the appearance of new antibody structures in the PDB follows 
a constantly varying scientific interest for diseases, therapeutics and basic research, and as 
such the obtained dataset cannot be considered representative of some random process. In 
this sense it is anecdotal that a few months before the closure of the dataset, a considerable 
number of anti-HIV and anti-'flu antibody structures (33/128 structures released in 201 1, 
i.e., ~26%), all with very characteristic CDR conformations, had emerged in the PDB 
following the research trend for that period. 

The solution to noise data was the efficient exclusion of outliers/ singletons from clusters, 
coupled with the nested architecture of the final clustering result. The efficient exclusion 
was ensured by the requirement that clusters form a tight core while all cluster objects 
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present an individual positive SW with respect to the global cluster partition. Though it 
was still possible that few, very small 2- or 3-member clusters failed to form due to the 
positive SW requirement, the subsequent 2nd- and 3rd-level qualitative clustering, based 
on Ramachandran Logos, would create a common conformational tag to allow recognition 
and classification of even such small outlying groups. Daughter-level sub-clusters mainly 
provide a means to identify all the members of important or subtle conformational 
variants of the parental theme, and by that fact offer more common examples for the 
researcher to compare their CDR with. Finally, it remains the individual researcher's 
decision as to which CDR conformations are useful, important, or potentially wrong. 
However when consulting the clustering results of this study, the data is classified in such 
a way and with no loss of information due to pre-filtering, that the researcher has at their 
disposal all the necessary information to help them take that decision. 

As a means of external validation, it is important to observe the comparison and relation 
of conformational CDR clusters between this and the major previous studies. As far as 
the first five CDRs are concerned, in many cases clusters from previous work were found 
to correspond to level- 1 clusters from this study on a one-to-one basis (36/72 compared 
clusters from North, Lehmann &Dunbrack (2011), 21/49 compared clusters from Martin & 
Thornton (1996), 8/13 compared clusters from Kuroda et al. (2009)), while in several cases 
more than one cluster from those external sets was found to correspond to the same level- 1 
cluster (correspondingly for the aforementioned studies: 25/72 clusters contained in 9 
level- 1 clusters, 1 5/49 clusters contained in 7 level- 1 clusters, and 5/13 clusters contained in 
2 level- 1 clusters) . This is characteristic of the different clustering strategies adopted in each 
study, as the external sets imposed discreet similarity thresholds on their cluster definition, 
but also of the fewer number of structures in their datasets which allowed for a sharper, 
more specific clustering when the data configuration was favourable. In all those cases, 
the external clusters are still distinct in the present clustering result, as they almost always 
correspond to different level-2 clusters from this study. In only two cases (clusters 16A/16C 
in CDR-L1 from Martin & Thornton (1996), and clusters 1A/1B in CDR-L3 from Kuroda 
etal. (2009)) were external clusters differentiated only at the 3rd-level, meaning that the 
full, 3-level conformational logo is required to describe them. Finally, in several cases 
small 2-, or 3-member external clusters, or mere singletons, were found to correspond to 
outliers in this study (11/72 in North, Lehmann &Dunbrack (2011), 13/49 in Martin & 
Thornton (1996)), because of the specific requirements for the existence of a tight core and 
all positive individual SW, as explained previously. Even so, these small external clusters 
are still distinct in the present result as their members are regrouped at the 2nd-level of 
clustering. The additional full population analysis of cluster assignments between this 
study and previous work showed consistency of membership correspondences, at 98% 
(262/268) for Martin & Thornton (1996), at 97% (1,534/1,589) for North, Lehmann & 
Dunbrack (2011 ) and at 98% (188/192) for Kuroda et al (2009). Most of the observed 
discrepancies concerned outlying conformations (6/6, 32/55 and 2/4, correspondingly for 
the aforementioned works). In comparison, the present clustering analysis revealed 117 
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level- 1 clusters in the first five CDRs, 66 of which have no correspondences and are novel. 
This is due to the larger dataset and to the lack of data pre-filtering. 

In CDR-H3, full population correspondences with North, Lehmann & Dunbrack 
(2011) were expectedly poor (56%, 171/307). This is explained by the much larger 
number of clustered structures (2,032 versus 307) and the different strategy employed 
in level- 1 cluster formation, but also to some extent, by the discrepancy of 2 residues 
in the respective CDR-H3 definitions. Indeed, the inclusion of all available CDR-H3 
loops in the present clustering procedure allowed an even clearer appreciation of their 
pronounced conformational hypervariability: 25 H3 lengths, 213 clusters, most of which 
are in fact singletons that technically acquired the status of a 'cluster', because they were 
represented by more than one structure in the initial dataset. In fact, only 53/213 clusters 
were populated by more than 1 unique CDR sequence; while a revealing total of another 
412/2,032 structures were left as outliers/singletons. In this landscape of variability in 
conformation, sequence and length, the adopted level- 1 clustering methodology doesn't 
expand a cluster's radius towards closely-related conformations, but instead restricts that 
radius appropriately, excluding structures that both fail to form a well-separated core and 
do not clearly belong to one cluster rather than another. However, these outlying structures 
are still further classified based on their Ramachandran logos, whenever possible (i.e., at 
level-2 and -3 of the classification scheme). 

All these observations are suggestive of the advantages brought by the multi-level 
clustering structure, as nearly all identified external clusters are distinct at the 2nd-level 
of our clustering (mainly in the first five CDRs), with the lst-level expanding towards 
closely- related conformational variants when possible, while efficiently excluding outliers. 
3rd-level clusters procure even deeper specificity when required. It becomes apparent that 
the trade-off between conformational specificity and sensitivity is locked in the clusters of 
previous studies based on the existence of a strict, but subjective, formation threshold. In 
contrast, the present clustering result produced a more adaptable framework, where the 
sensitivity and specificity of conformational similarity are more intuitively distributed in 
its three different levels. As an example of the conformational variability between level- 1 
clusters in this study and North, Lehmann & Dunbrack (201 l),a comparative view of all 
detected clusters in CDR-H1 13-residues (displaying a rich cluster repertoire) superposed 
on those from North, Lehmann & Dunbrack (201 1 ) where applicable, is presented in Fig. 7. 

The description and commenting of each CDR/length combination obtained in this 
study may be of small value at this point, firstly due to the massive volume of the data 
involved, but mainly because the detailed examination of each cluster could warrant a 
separate, dedicated study in its own right (something that the present study aims to assist 
and encourage). Nonetheless, it is interesting to observe that in almost all CDR/length 
combinations with substantial content in unique CDR sequences (i.e., more than 10 
unique sequences) there is usually a single cluster which regroups the large majority of 
the available known conformations, while the remaining fraction may be populating 
a considerable number of much smaller clusters. In the 15 lengths (first 5 CDRs) that 
contained more than 10 unique sequences in their clustered population and produced 
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Figure 7 A comparative view of all CDR-H1/13 residue clusters obtained in this work (in yellow), superposed to their correspondences from North, 
Lehmann & Dunbrack (2011), where applicable. Level-1 clusters from this work expand whenever possible towards closely-related variants, which 
are then further classified at levels 2 and 3 (complete 3-level classification in this work of the external median is given in brackets). This can be 
appreciated in clusters Hl-13-1 and H1-13-III from this work. The last four structures of this figure correspond to cluster medians from North, 
Lehmann & Dunbrack (2011) that were classified as outliers/singletons in this work. 



more than one cluster, the major cluster of each length represented on average 74% of the 
available unique sequences (median: 86%). The case of H2/10-residues is the one exception 
with two well-populated clusters (H2-10-I and -II) with an approximate 1:2.5 ratio in 
non-redundant members. L3/10-residues are the only other exception where no major 
cluster is observed despite the considerable amount of available unique sequences. 

Given the considerable volume of structural data included in the work, the above fact 
could be suggesting that in contrast to the original observation that CDRs adopt one of 
a limited number of possible conformations in LI, L2, L3, HI and H2, in fact three out 
of four CDR sequences seem to result in variants of the prominent conformation for 
that CDR length. To take this matter even further and based on the respective median, 
it can also be inferred that in half the well-populated CDR lengths, a variant of the 
prominent conformational theme is adopted by close to nine out of ten CDR sequences. 
Furthermore, the animal sources of CDR members of these major clusters are sufficiently 
varied to suggest that the respective conformations are ubiquitously maintained. 
These observations combined highlight the importance of subtle conformational 
variations in antigen recognition and, therefore, of the detailed repertoire provided at 
levels 2 and 3 of the present clustering analysis (e.g., by rogue analysis at the daughter 
cluster level). In contrast, the hypervariable (in length, sequence and conformation) 
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CDR-H3 appears as the loop that consistently confers the most pronounced layer of 
conformational variation in the antibody binding interface. 

It is known from experience with humanised antibodies {Saldanha, 2009) that the 
conservation of residues which maintain the conformation of the CDR in the designed 
sequence often leads to binding versions and vice versa. Further investigation of the 
clusters, particularly at levels 2 and 3, for these residues will enhance the modelling and 
design of humanised sequences by recognition, within the variants, of subtle differences to 
the main conformational theme. 

CONCLUSION 

By producing a classified snapshot of the entirety of the CDR conformations in the 
PDB, the aim was to present the experimentally known repertoire in a way that also 
allows inferences on the relationship between conformations. The latter exist as the 
result of backbone flexibilities, induced-fit, local sequence causing subtle variants, or 
even erroneous experimental data. Consequently, any conclusions on the quality or 
truthfulness of a structure can be drawn by the aid of this classification, instead of 
arbitrarily discarding all dubious cases from the very beginning. The dedicated analysis 
of structures belonging to different clusters, despite having the same CDR or even complete 
Fv sequence, could prove helpful towards this end. Therefore, the present clustering study 
can be viewed as a necessary 'logistical task', where no information is lost, whose value is 
best described by the possibilities it offers for a range of future specialised analyses, rather 
than a 'one-stop' study that allows derivation of final conclusions on the available CDR 
conformations. The results provided here include richly annotated cluster summaries and 
cluster memberships, a three-level classification, detailed comparisons with previously 
established CDR conformational clusters, lists of rogue CDR sequences and minimum 
Sequence Distance heatmaps. 

The focus of this study was to produce a complete repertoire of available CDRs, with 
multi-level clusters that allow the user to select the desired conformational specificity or 
sensitivity, but also with an increased potential for predictability from sequence. As a piece 
of subsequent work based on the present clustering results, a comparative assessment of 
predictive methods from sequence of CDR conformation (canonical templates, sequence 
rules and a new method named Disjoint Combinations Profiling (DCP)) was carried out 
by the same group {Nikoloudis, Pitts & Saldanha, 2014), with very encouraging results. An 
implication that could be attributed to those results, considering that no clustered data 
was discarded, is that the present clustered set was conformationally meaningful at its 
level- 1 instance, despite the designed tendency of clusters to expand towards potential 
variants of the main conformational theme. This is based on the fact that using this 
clustered set for training/updating produced DCP models achieving a range of 90%-99% 
cumulative accuracy on predictable conformations of the new dataset (CDR-L1, -L3, 
-HI, -H2, -H3-base), while canonical templates achieved 91% and 94% in CDR-L1 and 
CDR-L3, respectively. Therefore, the clustering goal of presenting a complete repertoire 
of conformational families could be considered successful as the most related backbone 
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variations were attributed correctly to the most appropriate class. This clearly did not 
negatively influence class identification from sequence and possibly even enhanced it. 
Additionally, this companion article also includes a visual analysis of CDR structures that 
fall into different conformational classes despite being present in identical Fv sequences. 

In conclusion, an accurate CDR classification is presented with novel characteristics, 
richly annotated and post-analysed clustered data, and also compared with previous work. 
In all cases, it is believed that the present analysis fills a gap in antibody CDR studies, by 
creating links between all related prior knowledge, while proposing new directions for 
future research. 
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